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1. INTRODUCTION 

Autonomous vehicles are a prominent research topic in the computer vision. It is necessary to correctly 
measure the three-dimensional (3D) view of the surrounding region of the vehicle in real time to make a driving 
decision. Precision of the depth map is crucial for the safety measure of autonomous vehicles. In these vehicles 
the depth details of the surrounding region are usually extracted using the hardware like light detection and 
ranging sensors. These sensors are expensive to install and also have certain drawbacks that may lower the 
standard of the depth information. These sensors do not provide additional information like traffic light color 
which plays a major role in decision making. Computer vision based stereo matching could be an alternate 
solution to overcome this drawback. The aim of stereo matching is to find matching pixels of images from 
different viewpoints and then estimate the depth [1]-[3]. It finds its applications in augmented reality, robotics, 
3D reconstruction [4]—[9]. Stereo vision tries to imitate the process in the human eye and the human brain. A 
scene taken from two cameras displaced horizontally will form two slightly separate projections. Disparity is the 
horizontal displacement in an object. A map that contains displacement of all pixels in an image is known as 
disparity map. Depth of a scene can be estimated from this disparity map. 
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In recent past many stereo algorithms were proposed [10]-[12]. A classic stereo algorithm mainly 
follows three steps namely: computing the pixel wise features, construction of cost volume followed by post- 
processing. Traditional stereo matching methods are grouped as local, global and semi-global methods. Local 
methods rely on low level pixel features to compute the similarity in the cost computation step. They estimate 
the correspondence by means of a window or support region [13]—[15]. Since the pixel wise characterization 
play a major factor, a wide variety of these representations are used by researchers varying from a simple rgb 
representation of pixels to the other descriptors like census transform, scale invariant feature transform. 
Segment based super pixel technique is proposed in [16]. After finding the edges and matching cost, adaptive 
support weight is used in cost aggregation. It proposes dual path refinement to correct disparities. Stereo 
matching based on adaptive cross area and guided filtering with orthogonal weights (ACR-GIF-OW) is 
proposed in [17]. These techniques are computationally less expensive but do not produce accurate results in 
the texture less, discontinuous and occluded areas. 

A Global methods handle texture less regions or uneven surfaces by including smoothness cost. 
Global methods make use of global energy function. The energy function is minimized step by step to 
compute disparity by assuming matching as a labelling problem. The pixels are considered as nodes and 
disparity estimated is considered as labels. The global methods use data and smoothness term to compute the 
energy function to produce smooth disparity. Graph cut [18], dynamic programming [19] and belief 
propagation [20] are the classic global matching algorithm. A tree structure is proposed in [21] named 
pyramid-tree that performs cross regional smoothing and handling region of low texture. In addition, they 
used log angle for cost computation which is robust to inconsistencies. The performance of global methods is 
limited because these approaches depend on hand-crafted features and hence do not produce accurate results. 

Convolutional neural network (CNN) is popular in different vision [22]—[24] applications. These 
methods are widely used in stereo matching. It improves the performance as compared to traditional 
methods. Kendall et. al. [25] the authors presented an architecture that learns disparity without regularization. 
Features are extracted automatically using CNN without any manual intervention. These features are used to 
perform stereo matching, that can handle texture less regions or uneven surfaces. Eigen et. al. [26] made use 
of basic neural networks to determine depth of a scene. They used AlexNet architecture to generate coarse 
map. Another network is followed that performs local refinements. The work proposed in [27] included the 
process of multi-stage framework that combined random forests and CNN. An architecture named neural 
regression forest is used to find depth from single input image. It allows parallel training of all CNN. Finally, 
a bilateral filter was used to obtain a refined disparity map. A similar concept is presented in [28] where 
many tiny neural networks were trained across overlapping patches. DispNet is one of the basic networks 
used for disparity estimation. A cascading residual learning network is used in [29] that extend the DispNet 
structure. It is obtained by using DispFullNet and DispResNet. The initial stages of CNN uses DispNet with 
an additional up convolution module. This help to extract more information. The next stage generates 
residual signal that helps in refinement. A trainable network is explained in [30]. It uses a robust 
differentiable patch match internal structure that discards most disparities without performing cost volume 
evaluation fully. This reduces search space and increases memory and time efficiency. The main drawbacks 
of existing methods are that the ill posed regions are not handled effectively. In the proposed method CNN is 
combined with optimization technique. CNN is used to replace the the hand-crafted term with the learned 
features. The output of CNN is used to calculate the unary and smoothness cost. Smoothness cost is added by 
taking the information from the neighboring pixels. Smoothness cost estimates the contrast-sensitive 
information to get a smooth disparity map. Post processing is performed to handle occlusion. 

In stereo vision, the areas visible in one view may not be visible in another. It is often difficult to 
reconstruct such regions in one image by looking at the other. The losses computed in these areas are noisy, 
leading to inaccurate results specifically in the occluded areas. Disparity refinement is implemented to 
enhance the accuracy of matching in ill posed areas. The left-right consistency check is the common method 
used to identify and handle the outliers. Even though several methods were proposed in the past to enhance 
the efficiency of matching, the low accuracy problem especially in the ill posed areas has not been handled 
very well. In order to handle these areas, post processing is performed by means of a generative adversarial 
network (GAN) model put forward by Goodfellow [31]. GAN is a structure used for training generative 
model. It uses the concept of min-max game. The two models namely generator model and a discriminative 
model is used to analyze the distribution of data. The generator tries to understand the distribution which is 
almost same to the real distribution of data. The ability to generate high quality image by GAN makes it 
applicable in several image processing applications. An encoder decoder structure is used for training in 
reconstructing the images. This model can produce various realistic representation of input by altering the 
attribute values. A conditional adversarial network [32] can be used for image translation. This translation 
converts the image from one representation to the other such as day to night. 
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We propose a hybrid CNN based deep stereo network model (CDSN) to estimate the disparity map 
that can produce accurate results. Loopy belief propagation is used to compute initial disparity map from 
features extracted from CNN. A generative neural network is used to handle the ill posed regions in the 
disparity map. The generated images look more realistic and closer to ground truth disparity map. The 
obtained result show that the proposed CDSN model handle the ill posed regions like discontinuities in the 
image boundaries and occluded areas effectively. The proposed model outperforms the other existing 
techniques on Middlebury dataset [33]. The paper is organized in a manner, section 2 explains the proposed 
CDSN model. Section 3 depicts the results of proposed model. The conclusions of the paper are presented in 
section 4. 


2. METHOD 

A CNN based model is proposed for stereo matching to find disparity map. The features extracted 
from CNN is used to compute the unary cost and smoothness cost. Global energy function is adapted to get 
the initial disparity map. A GAN model is used to handle ill posed region. Table 1 depicts the list of symbols 
with its description. The flow chart of proposed model is displayed in Figure 1. 


Table 1. Symbol table 


Symbol Description 
D(di) Unary cost at pixel ‘i’ 
Vv} Left feature vector 
‘a Right feature vector 
S(di,dj) Smoothness cost 
aand Bp Smoothness constants 
msg ';\(d;) Message from pixel ‘i’ to ‘j’ at iterations ‘t’ 
D(p, q) discriminator 
Enq Expected values of all real data instances 
G(p.r) generator 
Ep: Expected values of all generated instances 
D, Ground truth disparity map 
D, Estimated disparity map 
T Threshold value 
N Total number of pixels 


LEFT IMAGE 


RIGHT IMAGE 


EXTRACT CNN EXTRACT CNN 
FEATURES FEATURES 


USING GAN 


Figure |. Flowchart of the CDSN model 
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2.1. CNN feature extraction 

Conventional algorithms for stereo matching focuses on hand crafted features which leads to 
inadequate image information. CNN is used for the various vision problems including stereo matching. The 
CNN can extract local context better, hence it is robust to any photometric differences. The feature 
descriptors are extracted from rectified stereo images using a pre-trained visual geometry group (VGG-16) 
model [34]. The VGG-16 model is trained using ImageNet dataset which contains 14 million labelled images 
that are of high resolution that belong to 1,000 classes. The output of the 9th layer is used for stereo matching 
in the proposed model as it presents an appropriate feature space for computing disparity. VGG-16 uses a 
max pool layer that select the maximum element from the input map using a filter of 2x2. The first and 
second layers include 64 channels of 3x3 kernel size which is followed by max pool function of stride 2, 2. 
The third and fourth layers include 128 channels of 3x3 kernel followed by max pool function of stride 2, 2. 
The next three layers include 256 channels of 3x3 kernel that is followed by max pool function of stride 2, 2. 
Eighth and ninth layers include 512 channels of 3x3 kernel size. An N-dimensional feature vector is obtained 
for every location of pixel. 


2.2. Initial disparity map estimation 

The extracted feature descriptors are used to determine the matching cost of every pixel in left 
feature map. We search horizontally along the right feature map for the best matching value. The matching 
unary cost is calculated using the Euclidian distance of two feature descriptors using (1). 


D(d;) = min||V' — Vv" || (1) 


Unary cost may not yield optimal result in the texture less, repetitive patterns, discontinuity regions. 
The smoothness cost is used to smoothen the unary cost. Many smoothening techniques is proposed in the 
recent past. Most of these methods use random variables to have the disparity of a pixel, which encodes 
smoothness cost based on some standard constant. The smoothness cost is estimated based on neighbouring 
pixel information. The smoothness cost penalizes the inconsistent disparity values. The smoothness cost is 
computed using (2), 


ax * (aj-aj)° 


S(d;,d;) = Gear 


(2) 


Let P represent pixels in the image. The initial disparity map d; of each pixel i € P is estimated using energy 
function E 


E(d) = Yiep D(d;) + Da,en S(di d;) (3) 


The proposed method uses max product variation of loopy belief propagation (LBP) [20] to obtain 
the best disparity map. LBP is an algorithm based on assigning label to each pixel imposing global 
constraints and message passing. This is an iterative method where the messages are passed to left, right, top 
and bottom in each iteration. In each iteration t, the message is passing from pixel i to pixel j using (4), 


msgi.,;(di) = ming, [D(d;) + S(di, d;) + Yaenwy\y msgin\(d;))| (4) 


Here a represents all neighbours of i except j 
Belief is calculated by (5). 


Belief (d;) = D(di) + Yeenciy MSGK-i(4i) (6) 


The values d ranges from 0 to maximum disparity range and k represent neighbours of pixel i . The 
smooth disparity is obtained for iteration T that minimizes the Belief(d;). It is observed that the 
minimization of energy became constant after 10 iterations. Hence the proposed algorithm used 10 iterations. 


2.3. Disparity refinement using GAN 

The GAN network is used to refine the disparity. This refinement model is used to handle ill posed 
regions. The GAN can perform learning task automatically by identifying various patterns or irregularities 
from the input data. GANs have the ability to handle missing data such as occluded pixels in the disparity 
map. The two sub models in GAN are generator and discriminator. The generator model generates new 
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samples and discriminator model checks if the generated samples are similar to ground truth map. In the 
proposed model the network learns through ground truth. The architecture of this refinement technique is 
given in Figure 2. 


CLOSE TO 
GROUND 
TRUTH ? 


FINAL DISPARITY MAP 


GROUND TRUTH IMAGE 


DISCRIMINATOR 
GENERATED IMAGE 


NO 


INITIAL DISPARITY MAP 


Figure 2. Architecture of disparity refinement technique 


The proposed model uses Pix2Pix GAN model [32]. Pix2Pix GAN is simple and can produce high 
quality images for image translation applications. The efficiency of this GAN as compared to other GAN like 
CycleGAN [35] and DualGAN [36] is explained in the ablation study. The generator in Pix2Pix is a 
convolutional network that accepts initial disparity map as the input image and passes it through several 
convolution and up-sampling layers. Finally, it produces a refined disparity map, where all the occluded areas 
are filled with valid data. The U-Net auto encoding generator model is trained using adversarial loss that 
encourages it to create reasonable image. The encoder and decoder are made up of blocks of convolutional, 
activation layers and batch normalization layers. The generator is updated by loss that is generated between 
generated image and ground truth image. This information helps generator model to create more reasonable 
image that is similar to ground truth. The generator G is trained so as to generate output which can be 
differentiated from ground truth image by a discriminator D. The GAN objective is represented as, 


Lean (G,D) = Eyq (logD(p,q)] + Ep, [log (1 - D(p, G.r)))| (6) 
Here p denote a ground truth image, q represent the generated image and r represent the initial disparity map 
G* = argmingmaxgLlgan(G,D) (7) 


G aims to decrease the objective and D aims to increase the objective. 
The generator G tries to move the generated image closer to ground truth image using loss L, which is 
calculated as 


Loss,,(G) = Epar [I(q- 6@.7))Il,| (8) 
The final objective is represented as 
G* = argmin,maxgLgan(G,D) + ALoss,, (G) (9) 


The visual arti-facts were reduced for the value of A=100. 

The network is trained by images from the Middlebury dataset [33]. The network is tested for 100, 
200, 300, 400 epochs. The best disparity map is achieved for 300 epochs. The output from the generator is 
fed to the discriminator together with ground truth image. The gradient loss is calculated with respect to 
generator and discriminator to update the model. The trained model is tested to yield a best disparity map. It 
is observed from the results that best disparity map was obtained by handling the ill posed regions. Figure 3 
shows the performace of the model with respect to training loss and training accuracy. Figure 3(a) dipicts 
training loss and training accuracy against the number of epochs is shown in Figure 3(b). Lower the loss 
better is the accuracy. 

To measure the efficacy of the model proposed, we deployed and tested our model on Dual Intel 
Xeon ES-2609V4 8C 1.7 GHz 20M 6.4 GT/s with 128GB memory, Dual NVDIA Tesla P100 graphics 
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processing unit (GPU) with 3584 cores and maximum of 18.7 TeraFLOPS. The proposed CDSN model is 
evaluated on Middlebury dataset images. These images are pre-processed and rectified stereo images. The 
output of the 9th layer pre-trained VGG-16 architecture is used for estimating initial disparity map using 
loopy belief propagation. Initial disparity map is estimated using python programming. GAN is implemented 
using Pytorch. The Adam optimizer is used to train the Pix2Pix GAN for 300 epochs to handle the ill posed 
regions. The learning rate has been initialised to 0.0002. The complexity of GAN model is summarized in the 
Table 2. 
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Figure 3. Performance of the model (a) training loss versus number of epochs and (b) training accuracy 
versus number of epochs 


Table 2. Complexity of GAN model 
Input size Optimizer Parameters Epochs Output size GPU memory GPU model 
256X256X3 Adam 54.414M 300 256X256X3 128GB Dual NVDIA Tesla P100 


3. RESULTS AND DISCUSSION 

The proposed model is analyzed for the images taken from Middlebury datasets namely “Jade 
plant”, “Piano”, “Pipes”, and “Recycle”. The test images with resolution are shown in Table 3. Middlebury 
2014 dataset contains 33 scenes that are classified into training, additional images and test images. Certain 
images are used more than once under various exposure. A very high-resolution images is the salient feature 
of the dataset. Ground truth maps and images are given at quarter, half and full resolution. 


Table 3. Images from Middlebury 2014 


Images Image resolution 
Jade plant 659x497 
Piano 707x481 
Pipes 735X485 
Recycle 720x486 


3.1. Qualitative comparison 

The qualitative results for estimating disparity map is depicted in Figure 4. From the top to bottom: 
Jade plant, piano, pipes, and recycle. Figure 4(a) shows the left image, Figure 4(b) shows the right image, 
Figure 4(c) represent the ground truth image and Figure 4(d) represent the estimated disparity map. 


3.2. Quantitative comparison 

The percentage of bad matching pixel (PBMP) and root mean square error (RMSE) metrics were 
used for quantitative analysis. Lower values of PBMP and RMSE indicates better efficiency. PBMP is 
calculated, 


PBMP = [- Yld(x,y) — d,g(x,y)| > | «100 (10) 
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RMSE is calculated as, 


RMSE = [=zld.-@,y) - dg(x,y)|"]? (11) 


For evaluations purpose, we compared CDSN model with existing stereo matching model. The 
compared matching models are: deep pruner [30] ACR-GIF-OW [17], and efficient stereo matching by log- 
angle and pyramid-tree (LPSM) [21]. The occluded areas are not dealt efficiently in [30]. Stereo matching 
proposed in [17] is computationally less expensive but do not produce accurate results in the texture less, 
discontinuous areas. Stereo matching proposed in [21] rely on hand-crafted cost matching and hence results 
produced are not accurate. The Middlebury evaluation leader board results of existing methods are used for 
comparison. The PBMP and RMSE results of the proposed model and existing techniques are shown in 
Table 4 and Table 5 respectively. The PBMP and average RMSE results of the proposed CDSN model is less 


than all three compared method. Hence the proposed model outperforms the compared method and hence 
suitable for disparity map estimation. 


Figure 4. Visual results on Middlebury images (a) left image, (b) right image, (c) ground truth image and 
(d) estimated disparity map 


Table 4. The quantitative results based on PBMP for error threshold = 1 between computed and ground truth 
disparities 
Jade plant Piano Pipes Recycle 
DEEP PRUNER [30] 62.8 41.0 53.8 36.8 


ACR-GIF-OW [17] 51.8 45.1 40.5 37.5 
LPSM [21] 59.2 44.8 46.3 36.8 
CDSN 50.76 39.71 40.47 33.57 


Table 5. The quantitative results based on RMSE between computed and ground truth disparities 
Jade plant Piano Pipes Recycle Average 


DEEP PRUNER [25] 28.2 4.64 13.7 3.81 12.58 
ACR-GIF-OW [17] 64.9 14.8 28.6 15.8 31.02 
LPSM [21] 34.8 6.09 16.3 5.79 15.74 
CDSN 6.07 8.73 8.88 8.48 8.04 


3.3. Ablation study 

We executed ablation study by comparing the proposed model with the models like CycleGAN and 
DualGAN. CycleGAN is a technique that performs image translation without using paired examples. This 
GAN uses unsupervised training. DualGAN is made up of two generators and two discriminators. It is trained 
to translate images from source to target and target to source. The various metric used are absolute relative 
distance (ARD), squared relative difference (SRD) and RMSE. Lower values indicate better performance. 
We find the efficicency of the proposed model is significantly high which is presented in the Table 6. 
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Table 6. Ablation study using metrics ARD, SRD, RMSE 
CycleGAN _DualGAN__ Proposed model 


ARD 0.032 0.035 0.016 Lower is better 
SRD 0.352 0.374 0.337 
RMSE 7.036 7.974 6.591 


4. CONCLUSION 

This paper presents a novel CNN based model for stereo matching to estimate disparity map from 
rectified stereo images which is useful in autonomous vehicles. The features extracted from CNN is used to 
compute the unary cost and smoothness cost. The initial disparity map is obtained using loopy belief 
propagation, which is then refined using a GAN model to handle the ill posed regions. It is found that the 
proposed model based on CNN generated disparity maps which are smoother than those generated using 
naive model and the ill posed regions are handled well using GAN network. The proposed model is evaluated 
qualitatively as well as quantitatively on various images from Middlebury stereo data set. The results 
determine that proposed model achieves best disparity map and outperforms existing methods. 
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